Introduction

[Rewrite the introduction to explain some of the analysis and the original setup.]

This files contains some simple analysis of the SPOOKY data. The goal is to remind ourselves of some of our basic tools for working with text data in R and also to practice reproducibility. You should be able to put this file in the doc folder of your Project 1 repository and it should just run (provided you have multiplot.R in the libs folder and spooky.csv in the data folder). If you open to file from a forked Week1-GitHub repo, you should have no trouble running the code directly.

Setup the libraries

The following chunk shows the libraries necessary for conducting the text analysis for this project.

packages.used <- c("ggplot2", "dplyr", "tidytext", "wordcloud", "stringr", "ggridges", "tidyr", "RColorBrewer", "psych")

# check packages that need to be installed.
packages.needed <- setdiff(packages.used, intersect(installed.packages()[,1], packages.used))

# install additional packages
if(length(packages.needed) > 0) {
  install.packages(packages.needed, dependencies = TRUE, repos = 'http://cran.us.r-project.org')
}

library(ggplot2)
library(dplyr)
library(tidytext)
library(wordcloud)
library(stringr)
library(ggridges)
library(tidyr)
library(RColorBrewer)
library(psych)

source("../libs/multiplot.R")

Read in the data

spooky <- read.csv('../data/spooky.csv', as.is = TRUE)

Descriptive Statistics

Before diving in, let’s see what the structure of the data is.

head(spooky)
summary(spooky)
##       id                text              author         
##  Length:19579       Length:19579       Length:19579      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

Spooky is a dataframe containing columns for the identification of the sentence (id), the sentence itself (text), and the author of the sentence (author). HPL is HP Lovecraft, MWS is Mary Shelley, and EAP is Edgar Allen Poe. Next, we look at some sample sentences to dig deeper into some potential text analysis.

spooky$text[1]
## [1] "This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall."
spooky$text[19579]
## [1] "He laid a gnarled claw on my shoulder, and it seemed to me that its shaking was not altogether that of mirth."
spooky$text[7500]
## [1] "I screamed and struggled, and after a blankness was again in my attic room, sprawled flat over the five phosphorescent circles on the floor."

These sentences allude to the “spooky” aspects of the horror fiction with words like “dungeon”, “gnarled claw”, and “screamed.”

How many sentences (text column) belong to each author?

From the above figure, we can see that Edgar Allen Poe has the most entries in the corpus at 7,900 sentences, followed by Mary Shelley at 6,044 sentences, and concluding with HP Lovecraft at 5,635 sentences. The distribution among authors is even enough that we can evaluate the texts of the authors and begin to understand the particular style and characteristics of each author.

Sentence Length

spooky$len <- str_length(spooky$text)

#png("../figs/median_sentence.png")
spooky %>%
  group_by(author) %>%
  summarise(CountMedian = median(len,na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(author = reorder(author,CountMedian)) %>%
  
  ggplot(aes(x = author,y = CountMedian)) +
  geom_bar(stat='identity',colour="white", fill = "lightgoldenrod3") +
  labs(x = 'Author', 
       y = 'Sentence Length', 
       title = 'Author and Median Sentence Length') +
  coord_flip() + 
  theme_bw()

#dev.off()

By looking at the median sentence length, we see that HP Lovecraft, tends to have longer sentences than the author two authors. Edgar Allen Poe has a median sentence that is about 20-30 characters shorter than Lovecraft. Mary Shelley is in between the two authors in terms of median length.

#png("../figs/wordlength.png")
spooky %>%
      ggplot(aes(x = len, fill = author)) +    
      geom_histogram() +
      scale_x_continuous(limits = c(15,100)) +
      scale_fill_manual( values = c("lightsteelblue3","lemonchiffon2","honeydew4")) +
      facet_wrap(~author) +
      labs(x= 'Word Length',y = 'Count', title = paste("Distribution of", ' Word Length ')) +
      theme_bw()

#dev.off()

Next, we look at the distribution of word length for each author by sentence. In order to better analyze the data, we limit the word length to 100. From the above figure, we see that Lovecraft and Shelley tend to have sentences with word length between 75 and 100, whereas Poe has sentences that have a wide range of word lengths with a dip around 65. Let’s do some simple numerical summaries of the data to provide some nice visualizations that help us to understand the similarities between authors.

Data Cleaning

Because of the natural structure of language, texts include words that do not provide much incite to the context of each sentence. These are words that are often categorized as articles, conjuctions, and prepositions. In order to address these words, I use “stop_words” to remove the words that, in theory, do not contribute much to each sentence. Additionally, “stop_words” removes punctuation and removes capitalization. This is important for textual analysis because we want to treat “Ghosts,” “ghosts,” and “ghost’s” as the same word, in terms of context.

spooky_word <- unnest_tokens(spooky, word, text)
head(spooky_word)
head(stop_words)
tail(stop_words)
spooky_word <- anti_join(spooky_word, stop_words, by = "word")
head(spooky_word)

Using Tidyverse to Analyze Spooky Data

Using some of the methods proposed in Text Mining with R; A Tidy Approach*, this section presents analysis on negation, frequency of words, length of sentences in terms of words and characters, and punctuation. By comparing the differences between authors by investigating the aforementioned categories, I can subsequently build a predictive model to match each sentence to an author. Because the authors are all writing horror stories, it is important to capture the grammatical and stylistic differences between Edgar Allen Poe, HP Lovecraft, and Mary Shelley.

Word Frequency

Let’s see which words are the most common across all of the authors, after eliminating the “stop_words,” which are words that do not provide much context about the sentence (words like articles, conjuctions, and prepositions).

# Words is a list of words, and freqs their frequencies
words <- count(group_by(spooky_word, word))$word
freqs <- count(group_by(spooky_word, word))$n

head(sort(freqs, decreasing = TRUE))
## [1] 729 563 559 559 540 516
head
## function (x, ...) 
## UseMethod("head")
## <bytecode: 0x7fbe70569310>
## <environment: namespace:utils>
pal <- brewer.pal(8, "Set2")

#png("../figs/Wordcloud_all.png")
wordcloud(words, freqs, scale = c(5, .2), min.freq = 3, max.words = 50, random.order = FALSE, rot.per = .15, colors = pal)

#dev.off()

From this wordcloud, we see that “time” is the word used most frequently across all authors. We also see that “life,” “night,” and “found” are in the next category of most frequently used words. In terms of horror fiction, these words seem to fit into what I would imagine as important words for establishing the setting and overall story. The next level of frequency (in green) further paints the scene for horror fiction, with words like “death,” “mind,” and “heard.” I also notice words like “horror,” “strange,” “black,” “fear,” and “spirit” that further contribute to the context of horror fiction.

However, we cannot say anything about if one author is using a particular word so often that it overwhelms the lack of use by the other two authors.

Word Frequency for Each Author

Next, I compare the wordclouds of negated words with the most popular words of each other.

# Make a table with one word per row and remove `stop words` (i.e. the common words)
spooky_word <- unnest_tokens(spooky, word, text)
spooky_word <- anti_join(spooky_word, stop_words, by = "word")

# MWS
mws_spooky <- spooky_word %>% 
   filter(author == "MWS")

mws.words1 <- count(group_by(mws_spooky, word))$word
mws.freqs1 <- count(group_by(mws_spooky, word))$n

#png("../figs/Wordcloud_mws.png")
wordcloud(mws.words1, mws.freqs1, scale = c(5, .2), min.freq = 3, max.words = 50, random.order = FALSE, rot.per = .15, colors = pal)

#dev.off()

# HPL 
hpl_spooky <- spooky_word %>% 
   filter(author == "HPL")

hpl.words1 <- count(group_by(hpl_spooky, word))$word
hpl.freqs1 <- count(group_by(hpl_spooky, word))$n

#png("../figs/Wordcloud_hpl.png")
wordcloud(hpl.words1, hpl.freqs1, scale = c(5, .2), min.freq = 3, max.words = 50, random.order = FALSE, rot.per = .15, colors = pal)

#dev.off()

# EAP
eap_spooky <- spooky_word %>% 
   filter(author == "EAP")

eap.words1 <- count(group_by(eap_spooky, word))$word
eap.freqs1 <- count(group_by(eap_spooky, word))$n

#png("../figs/Wordcloud_eap.png")
wordcloud(eap.words1, eap.freqs1, scale = c(5, .2), min.freq = 3, max.words = 50, random.order = FALSE, rot.per = .15, colors = pal)
## Warning in wordcloud(eap.words1, eap.freqs1, scale = c(5, 0.2), min.freq =
## 3, : question could not be fit on page. It will not be plotted.
## Warning in wordcloud(eap.words1, eap.freqs1, scale = c(5, 0.2), min.freq =
## 3, : altogether could not be fit on page. It will not be plotted.

#dev.off()

When looking at the frequency of words by each author, it appears that the use of “time” is dominated by Poe and Lovecraft, whereas Shelley uses “time” at a lower frequency. These wordclouds are assisting us in understanding how each author differs from the others. Mary Shelley stands out with words like “life,” “heart,” and “love” when Poe and Lovecraft use these words less frequently or not at all. The wordcloud also provides clues about the names of characters, which are an easy way to identify the text of authors. However, Shelley again appears to be an outlier among the authors as “Raymond” and “Perdita” are character names, but Poe and Lovecraft do not use character names as often. It is possible that EAP and HPL prefer to use pronouns to refer to characters, and these are removed when we use “stop_words.”

# Counts number of times each author used each word.
author_words <- count(group_by(spooky_word, word, author))

# Counts number of times each word was used.
all_words <- rename(count(group_by(spooky_word, word)), all = n)

author_words <- left_join(author_words, all_words, by = "word")
author_words <- arrange(author_words, desc(all))
author_words <- ungroup(head(author_words, 81))

#png("../figs/author_words.png")
ggplot(author_words) +
  geom_col(aes(reorder(word, all, FUN = min), n, fill = author)) +
  scale_fill_manual( values = c("lightsteelblue3","lemonchiffon2","honeydew4")) +
  xlab(NULL) +
  coord_flip() +
  facet_wrap(~ author) +
  theme(legend.position = "none")

#dev.off()

From the wordclouds, we know that some authors favor certain words over other, but it is also important to understand if these words are outliers in the entire corpus. From the above figure, we see that Mary Shelley does use “life,” “love,” and “heart” more often than the other two authors (which confirms our assumptions above). We can also see that HPL also never uses “heart” or “love,” and rather prefers words like “night” and “time.” MWS also tends to use “death” more than the other two authors. It is not clear if this is because she refers to death more often or if the HPL and EAP use different words and statements to refer to death.

Visualizations of Negative Bigrams

It’s possible that the “stop_words” are eliminating some of the important context clues that the authors provide to their respective readers. In this section, I look at the use of negation, specifically “not” and how this helps us understand the themes of each author. There is a clear difference between the two phrases: “she was living” and “she was not living,” and it is important for us to understand the difference and how each author uses negation to convey points.

# Subset data to MWS only and make bigrams
mws.bigrams <- spooky %>% 
  filter(author == "MWS") %>% 
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

mws.bigrams %>% 
  count(bigram, sort = TRUE)
# Split bigram into one word per column
mws.split <- mws.bigrams %>% 
  separate(bigram, c("word1", "word2"), sep = " ")

# Filter bigrams by "not" 
stop_not <- stop_words %>% 
  filter(word == "not")
stop_not <- anti_join(stop_words, stop_not, by ="word")

mws.filtered <- mws.split %>% 
  filter(word1 == "not")

# Remove stop_words except for not
mws.filtered$word <- mws.filtered$word2
mws <- anti_join(mws.filtered, stop_not, by = "word")

# mws.words is a list of words, and mws.freqs their frequencies
mws.words <- count(group_by(mws, word2))$word2
mws.freqs <- count(group_by(mws, word2))$n

#png("../figs/Wordcloud_mws_not.png")
wordcloud(mws.words, mws.freqs, scale = c(5, .2), min.freq = 3, max.words = Inf, random.order = FALSE, rot.per = .15, colors = pal)

#dev.off() 

When looking at the negated words for Shelley, it is interesting the “die” or rather “not die” is the most frequently used word, when her previous wordcloud shows “life” as important for her literary style. It is also interesting that “not love” appears with a similar frequency as above. Although Shelley appears to be more positive than the other two authors in the word frequency comparison above, it is possible from this textual analysis that she is negating these positive words, such as “love,” “hope,” and “feel.”

Replicating the above analysis for HPL

# Subset data to HPL only and make bigrams
hpl.bigrams <- spooky %>% 
  filter(author == "HPL") %>% 
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

hpl.bigrams %>% 
  count(bigram, sort = TRUE)
# Split bigram into one word per column
hpl.split <- hpl.bigrams %>% 
  separate(bigram, c("word1", "word2"), sep = " ")

# Filter bigrams by "not" 
hpl.filtered <- hpl.split %>% 
  filter(word1 == "not")

# Remove stop_words except for not
hpl.filtered$word <- hpl.filtered$word2
hpl <- anti_join(hpl.filtered, stop_not, by = "word")

# hpl.words is a list of words, and hpl.freqs their frequencies
hpl.words <- count(group_by(hpl, word2))$word2
hpl.freqs <- count(group_by(hpl, word2))$n

#png("../figs/Wordcloud_hpl_not.png")
wordcloud(hpl.words, hpl.freqs, scale = c(5, .2), min.freq = 3, max.words = Inf, random.order = FALSE, rot.per = .15, colors = pal)

#dev.off()

Unlike MWS, HPL’s negated frequency does not provide much more context about his style of writing and these are not words that one would typically associate with horror fiction.

Replicating the above analysis for EAP

# Subset data to EAP only and make bigrams
eap.bigrams <- spooky %>% 
  filter(author == "EAP") %>% 
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

eap.bigrams %>% 
  count(bigram, sort = TRUE)
# Split bigram into one word per column
eap.split <- eap.bigrams %>% 
  separate(bigram, c("word1", "word2"), sep = " ")

# Filter bigrams by "not" 
eap.filtered <- eap.split %>% 
  filter(word1 == "not")

# Remove stop_words except for not
eap.filtered$word <- eap.filtered$word2
eap <- anti_join(eap.filtered, stop_not, by = "word")

# eap.words is a list of words, and eap.freqs their frequencies
eap.words <- count(group_by(eap, word2))$word2
eap.freqs <- count(group_by(eap, word2))$n

#png("../figs/Wordcloud_eap_not.png")
wordcloud(eap.words, eap.freqs, scale = c(5, .2), max.words = Inf, random.order = FALSE, rot.per = .15, colors = pal)

#dev.off()

Again, EAP’s negated frequency does not provide much context, but one could hypothesize that his characters might be more on a mission than the other authors. EAP uses phrases like “not fail” and “not impossible,” which may allude to some kind of adventure that the characters must embark on.

Frequency in which each author uses the negative words

stop_no <- stop_words %>% 
  filter(word %in% c("neither", "never", "no", "nobody", "non", "none", "noone", "not", "nothing", "nowhere"))

stop_neg <- anti_join(stop_words, stop_no, by = "word")

spooky_neg <- spooky %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_neg, by = "word")

author_neg <- count(group_by(spooky_neg, word, author))
all_words_neg <- rename(count(group_by(spooky_neg, word)), all = n)

author_neg <- left_join(author_neg, all_words_neg, by = "word")
author_neg <- arrange(author_neg, desc(all))
author_neg <- ungroup(head(author_neg, 100))

#png("../figs/author_neg.png")
ggplot(author_neg) +
  geom_col(aes(reorder(word, all, FUN = min), n, fill = author)) +
  scale_fill_manual( values = c("lightsteelblue3","lemonchiffon2","honeydew4")) +
  xlab(NULL) +
  coord_flip() +
  facet_wrap(~ author) +
  theme(legend.position = "none")

#dev.off()

As above, I investigate how the negated frequencies compare across authors, but I include the use of “not” to see how often each other is using negation. We see that EAP and MWS use negation much more than HPL, which helps to explain why HPL’s wordcloud was less informative about his style of writing.

Punctuation

Similar to investigating negation to understand the style of each author, there may be some clues in how each author uses punctuation in their sentences. Semicolons, colons, and commas are often clues to the reader about where the sentence is headed. In this section, we look at the use of semicolons, colons, and commas by each author.

#png("../figs/commas.png")
spooky %>%
  group_by(author) %>%
  summarise(SumCommas = sum(Ncommas,na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(author = reorder(author,SumCommas)) %>%
  
  ggplot(aes(x = author,y = SumCommas)) +
  geom_bar(stat='identity',colour="white", fill = "slategray2") +
  labs(x = 'Author', 
       y = 'Commas', 
       title = 'Author and Total Number of Commas') +
  coord_flip() + 
  theme_bw()

#dev.off()

EAP uses the most commas among the three authors, nearly double how many HPL uses. This is interesting because we know from above that HPL tends to have longer sentences (median length around 75-100 characters in each sentence), but he does not appear to use commas to separate out clauses. EAP on the other hand has more variability in his sentence lengths and appears to be comma heavy. Since commas and other forms of punctuations are “pauses” for the reader, maybe EAP uses these tools to slow down the reader and build suspense in his horror fiction.

#png("../figs/semicolons.png")
spooky %>%
  group_by(author) %>%
  summarise(SumSemiColons = sum(Nsemicolons,na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(author = reorder(author,SumSemiColons)) %>%
  
  ggplot(aes(x = author,y = SumSemiColons)) +
  geom_bar(stat='identity',colour="white", fill = "darkseagreen4") +
  labs(x = 'Author', 
       y = 'Semicolons', 
       title = 'Author and Total Number of Semicolons') +
  coord_flip() + 
  theme_bw()

#dev.off()

Semicolons are less frequently used than commas by all authors (likely because their grammatical purpose can be captured by “, + conjuction”), but Shelley appears to favor the use of semicolons in her literature. Again, HPL does not use many semicolons and may prefer other styles of leading the reader.

#png("../figs/colons.png")
spooky %>%
  group_by(author) %>%
  summarise(SumColons = sum(Ncolons,na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(author = reorder(author,SumColons)) %>%
  
  ggplot(aes(x = author,y = SumColons)) +
  geom_bar(stat='identity',colour="white", fill = "antiquewhite2") +
  labs(x = 'Author', 
       y = 'Number of colons', 
       title = 'Author and Total Number of Colons') +
  coord_flip() + 
  theme_bw()

#dev.off()

The number of colons, by any author, drastically decreases. Whereas EAP used close to 15,000 commas in his corpus, he uses fewer than 200 colons. MWS also has a decline in comparison to her use of the semicolon. And unsurprisingly, HPL uses even fewer colons (less than 50) in his entire corpus. Understanding how each author uses punctuation appears to be helpful for author identification. If we see a sentence that lacks commas, colons, or semicolons, it is more likely to be HP Lovecraft, whereas comma-heavy sentences are more predictive of Edgar Allen Poe.

TF-IDF for each author

Instead of looking at the word frequency (as we did above), I know turn to IDF, or inverse document frequency. IDF uses a weighting scheme based on the the common (or not so common) use of words. When it is combined with term frequency, we can analyze the frequency of a term after it is weighted for its rarity in the corpus.

frequency <- count(spooky_word, author, word)
tf_idf <- bind_tf_idf(frequency, word, author, n)
head(tf_idf)
tail(tf_idf)
tf_idf <- arrange(tf_idf, desc(tf_idf))
tf_idf <- mutate(tf_idf, word = factor(word, levels = rev(unique(word))))

# Grab the top twenty tf_idf scores in all the words for each author
tf_idf <- ungroup(top_n(group_by(tf_idf, author), 20, tf_idf))

#png("../figs/tf_idf.png")
ggplot(tf_idf) +
  geom_col(aes(word, tf_idf, fill = author)) +
  scale_fill_manual( values = c("lightsteelblue3","lemonchiffon2","honeydew4")) +
  labs(x = NULL, y = "tf-idf") +
  theme(legend.position = "none") +
  facet_wrap(~ author, ncol = 3, scales = "free") +
  coord_flip() +
  labs(y = "TF-IDF values")

#dev.off()

Unlike the wordclouds and frequencies above, I now observe that many of the authors use character names. MWS has more character names in her TF-IDF, EAP uses more foreign words (French mostly), and HPL has more “folky” words like folk and beard and legends.

More TF-IDF

spooky_tf <- spooky %>% 
  unnest_tokens(word, text) %>% 
  count(author, word, sort = TRUE) %>% 
  ungroup()

total_words <- spooky_tf %>% 
  group_by(author) %>% 
  summarize(total = sum(n))

spooky_tf <- left_join(spooky_tf, total_words, by = "author")

#png("../figs/spooky_tf.png")
ggplot(spooky_tf, aes(n/total, fill = author)) +
  scale_fill_manual( values = c("lightsteelblue3","lemonchiffon2","honeydew4")) +
  geom_histogram(show.legend = FALSE) +
  stat_bin(binwidth = 0.00025) +
  xlim(NA, 0.0009) +
  facet_wrap(~author, ncol = 3, scales = "free_y") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 345 rows containing non-finite values (stat_bin).

## Warning: Removed 345 rows containing non-finite values (stat_bin).
## Warning: Removed 3 rows containing missing values (geom_bar).

#dev.off()
freq_by_rank <- spooky_tf %>% 
  group_by(author) %>% 
  mutate(rank = row_number(), 
         `term frequency` = n/total)

#png("../figs/zipf.png")
freq_by_rank %>% 
  ggplot(aes(rank, `term frequency`, color = author)) +
  geom_abline(intercept = -0.62, slope = -1.1, color = "gray50", linetype = 2) +
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
  scale_x_log10() +
  scale_y_log10()

#dev.off

The above TF-IDF illustrates that most of the authors use some words very often and other words rarely. This is typical of most textual analysis, and we see from the figure that our corpus follows Zipf’s law, which states that the relationship between word frequency and rank should resemble a -1 slope (which is observable in the dashed gray line).

Sentiment lexicon

In the above analysis, we have attempted to understand how each author uses certain words and punctuation, but we have yet to adequately evaluate the sentiments or contexts of the “spooky” corpus. In this section, I use different sentiment lexicons to understand the feelings behind each author. Is Mary Shelley actually more positive in using words like “love” and “life?”

# Keep words that have been classified within the NRC lexicon.
get_sentiments('nrc')
sentiments <- inner_join(spooky_word, get_sentiments('nrc'), by = "word")

count(sentiments, sentiment)
#Record frequency of each sentiment/emotion for each author
count(sentiments, author, sentiment)
#png("../figs/sentiments.png")
ggplot(count(sentiments, sentiment)) + 
  geom_col(aes(sentiment, n, fill = sentiment)) +
  scale_fill_manual( values = c("antiquewhite3", "honeydew4", "lavenderblush4", "lemonchiffon2",  "lightsteelblue3",  "lightgoldenrod3", "rosybrown3", "mistyrose3",  "slategray2", "peachpuff3"))

#dev.off()

#png("../figs/sentiments_author.png")
ggplot(count(sentiments, author, sentiment)) + 
  geom_col(aes(sentiment, n, fill = sentiment)) + 
  scale_fill_manual( values = c("antiquewhite3", "honeydew4", "lavenderblush4", "lemonchiffon2",  "lightsteelblue3",  "lightgoldenrod3", "rosybrown3", "mistyrose3",  "slategray2", "peachpuff3")) +
  facet_wrap(~ author) +
  coord_flip() +
  theme(legend.position = "none")

#dev.off()

In the entire corpus, we see that there are slightly more positive than negative words used. The NRC lexicon looks the following words: anger, anticipation, disgust, fear, joy, sadness, surprise and trust. From the horror fiction literature we have, we see that emotions of fear, trust, sadness, and anticipation are the most common sentiments.

When looking at the sentiments of each author, we can somewhat confirm our initial conjecture that MWS is more positive than the other two authors. ALthough EAP is more positive than negative and only slightly less positive than MWS, HPL is much more negative than he is positive. HPL has fewer words that are classified as “anticipation,” which is interesting since he also does not use punctuation to build anticipation among his readers. However, the lexicon is crowd-sourced, and it is possible that HPL uses words that we not categorized in this particular lexicon.

How does this compare to the “negative” sentiments?

nrc_neg <- filter(get_sentiments('nrc'), sentiment == "negative")
nrc_neg
negative <- inner_join(spooky_neg, nrc_neg, by = "word")
head(negative)
count(negative, word, sort = TRUE)
neg_words <- count(group_by(negative, word, author))
neg_words_all <- count(group_by(negative, word))

neg_words <- neg_words %>% 
  left_join(neg_words_all, by = "word") %>% 
  arrange(desc(n.y)) %>% 
  head(100) %>% 
  ungroup()

#png("../figs/neg_words.png")
ggplot(neg_words) + 
  geom_col(aes(reorder(word, n.y, FUN = min), n.x, fill = author)) +
  scale_fill_manual( values = c("lightsteelblue3","lemonchiffon2","honeydew4")) +
  xlab(NULL) +
  coord_flip() +
  facet_wrap(~author) +
  theme(legend.position = "none")

#dev.off()

Continuing the trend of looking at negations, we see that MWS is more likely to use “death” in her repertoire compare dot EAP and HPL. Additionally, HPL almost never uses misery or despair, whereas MWS facors these two words. HPL is more likely to use words like “ancient” and “hideous” to create negative feelings. We can begin to see a problem with this lexicon as “mother” is associated with a negative sentiment.

What is left out of the sentiment analysis?

Maybe the above lexicon is not the best suited choice for sentiment analysis since the crowdsourcing occured in the 2010s and the authors were writing in the 19th and 20th century. Word usage changes and we may be missing some of the analysis.

sentiments.out <- anti_join(spooky_word, get_sentiments('nrc'), by = "word")

frequency_sent <- count(sentiments.out, author, word)
tf_idf_sent <- bind_tf_idf(frequency_sent, word, author, n)
head(tf_idf_sent)
tail(tf_idf_sent)
tf_idf_sent <- arrange(tf_idf_sent, desc(tf_idf))
tf_idf_sent <- mutate(tf_idf_sent, word = factor(word, levels = rev(unique(word))))

tf_idf_sent <- ungroup(top_n(group_by(tf_idf, author), 20, tf_idf))

#png("../figs/tf_idf_sent.png")
ggplot(tf_idf_sent) +
  geom_col(aes(word, tf_idf, fill = author)) +
  scale_fill_manual( values = c("lightsteelblue3","lemonchiffon2","honeydew4")) +
  labs(x = NULL, y = "tf-idf-sentiments") +
  theme(legend.position = "none") +
  facet_wrap(~ author, ncol = 3, scales = "free") +
  coord_flip() +
  labs(y = "TF-IDF of Missing Sentiments")

#dev.off

However, when we look at the missing sentiments of the corpus, we mostly see that the character names are not categorized. This does not provide as much information about the sentiments as we would have hoped.

How does this compare with other sentiment lexicons?

Let’s take a look at other sentiment lexicons, like AFINN and Bing.

sentiment.afinn <- spooky_word %>% 
  inner_join(get_sentiments("afinn"), by = "word") %>% 
  group_by(index = author) %>% 
  summarise(sentiment = sum(score)) %>% 
  mutate(method = "AFINN")

sentiment.bing.nrc <- bind_rows(spooky_word %>% 
                                  inner_join(get_sentiments("bing")) %>% 
                                  mutate(method = "Bing et al."),
                                spooky_word %>% 
                                  inner_join(get_sentiments("nrc") %>% 
                                  filter(sentiment %in% c("positive", "negative"))) %>% 
                                  mutate(method = "NRC")) %>% 
  count(method, index = author, sentiment) %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"

Bing is similar to the NRC lexicon, but it likely has different sentiments for differnet words as it only ranks words in binary form. The AFINN lexicon assigns scores to words ranging from -5 (negative) to 5 (positive). This will allow a more detailed understanding of the “positiveness” or “negativeness” of each other, as the estimate of net sentiment is positive - negative.

#png("../figs/sentiment3.png")
bind_rows(sentiment.afinn, sentiment.bing.nrc) %>% 
  ggplot(aes(index, sentiment, fill = method)) + 
  scale_fill_manual( values = c("lightsteelblue3","lemonchiffon2","honeydew4")) +
  geom_col(show.legend = FALSE) + 
  facet_wrap(~method, ncol =1, scales = "free_y")

#dev.off()

The lexicons appear to classify the authors in similar manners (at least directionally). All three illustrate that HPL is the most negative of the three authors, and MWS and EAP are somewhere in between negative and positive. It is important to note the difference in scales. While AFINN and Bing range from -4000 to 0, NRC ranges from -1000 to 1000.

More with bigrams

Although we have thoroughly evaluated the word frequency of each author, it may be helpful to take a final glance at pairs of words for each author. This is similar in nature to the negation analysis above, but I will filter out all “stop_words.”

spooky_bigrams <- spooky %>% 
  select(author, text) %>% 
  unnest_tokens(bigram, text, token = "ngrams", n = 2) 

bigrams_separated <- spooky_bigrams %>% 
  separate(bigram, c("word1", "word2", sep = " "))
## Warning: Too many values at 73 locations: 23377, 23378, 43990, 43991,
## 50280, 50281, 51795, 51796, 66685, 89613, 89614, 96313, 130481, 130482,
## 141392, 141393, 142663, 172745, 177649, 177650, ...
## Warning: Too few values at 518651 locations: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
## 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...
bigrams_filtered <- bigrams_separated %>% 
  filter(!word1 %in% stop_words$word) %>% 
  filter(!word2 %in% stop_words$word)

bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigrams_united <- bigrams_filtered %>% 
  unite(bigram, word1, word2, sep = " ")

bigram_tf_idf <- bigrams_united %>% 
  count(author, bigram) %>% 
  bind_tf_idf(bigram, author, n) %>% 
  arrange(desc(tf_idf))

#png("../figs/bigram_tf_idf.png")
bigram_tf_idf  %>%
  arrange(desc(tf_idf)) %>%
  mutate(bigram = factor(bigram, levels = rev(unique(bigram)))) %>%
  group_by(author) %>%
  top_n(20, tf_idf) %>%
  ungroup() %>%  
  ggplot(aes(bigram, tf_idf, fill = author)) +
  scale_fill_manual( values = c("lightsteelblue3","lemonchiffon2","honeydew4")) +
  geom_col() +
  labs(x = NULL, y = "TF-IDF values") +
  theme(legend.position = "none") +
  facet_wrap(~ author, ncol = 3, scales = "free") +
  coord_flip()

#dev.off()

The phrases most used by EAP and HPL are “ha ha” and “heh heh” respectively. Although HPL was categorized as more negative, he uses laughter (possibly in a rude way) often. EAP, again, uses more French phrases in his corpus than the other two authors. With MWS, we again see that she uses names of characters and places as well as “poor wretch/girl” frequently.

Concluding Remarks

In this text analysis, investigating punctuation and character names, as well as sentiment, appears to be the most promising source of author identification. We can easily classify EAP with shorter sentences and French words, whereas HPL can be categorized by more nature-themed phrases, and MWS by her use of love and positive phrases.

Much of the above work was adapted from the following posts: https://www.kaggle.com/headsortails/treemap-house-of-horror-spooky-eda-lda-features.
https://www.kaggle.com/ambarish/tutorial-detailed-spooky-fun-eda-and-modelling https://www.tidytextmining.com/ngrams.html